Note: Please see revised data frame as rev_coral_join.csv that prioritizes clade C.
read_csv("../output/coral_data.csv") -> corals
## Rows: 658 Columns: 5897
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): sample_id, Clade, Majority, species, region, reef
## dbl (5891): its2_count, ITS2_f, ASV0001, ASV0002, ASV0003, ASV0004, ASV0005,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#ITS2_f should be a factor
corals %>%
mutate(ITS2_f = as.factor(ITS2_f)) -> corals
dim(corals)
## [1] 658 5897
corals %>%
group_by(Clade) %>%
summarise(n = n())
## # A tibble: 2 × 2
## Clade n
## <chr> <int>
## 1 A 606
## 2 C 52
Clade A has 606 observations. 606/(606+52) = about 92 %
Clade C has 52 observations. 52/(606+52) = about 8 %
Notes: If we are trying to explain the variability of clades based on bacteria, logistic regression would be used to explain or predict either clade A or C with bacteria ASVs as columns; however, taking out other info (such as region) is not advisable as they are all possible predictors.
How correlated are the top 2 ASVs with each of the algae variables?
Logging the ASV counts presents better visualization:
corals %>%
ggplot(aes(x = log(ASV0001), y = log(ASV0002), color = Clade)) +
geom_point(position = "jitter") +
facet_wrap(~species)
Clade A is far more prevalent as is also shown in the number of abservations per Clade based on these data.
Distributions are different per species. ASV1 and 2 are more clustered for species P and more spread out in terms of counts for species S.
From coseq() K-means clustering, this is perhaps why ASV1 and 2 were consistently clustered together in species P but not for species S. Perhaps ASVs 1 and 2 play a different role in species S.
Recommendation to reduce dimensionality: After the coral data are split by species, client approved subsetting the first 1000 ASVs.
Recommendation for binary classification: Use Principal Components Analysis (PCA) in the file ‘PCA_CLR.Rmd’ to obtain the most contributing ASVs per species to make sure they are in the top 1000 (or fewer if reduced again later for computational limits).
corals %>%
filter(species =="P") %>% # 296: cut off will be 293 inclusive
select(-c(1:8)) -> corals_p # saving in order to pass to map() function
### These are the ASVs that are found in at least 3 samples or more for species P
map(corals_p, ~sum(.==0)) %>%
as_tibble() %>%
pivot_longer(cols = everything(),
names_to = "ASV_column",
values_to = "sum_col_eq_0") %>%
arrange(sum_col_eq_0) %>%
# cut off includes 293 or more because 296-293 = 3
filter(sum_col_eq_0 >=293) -> asvs_speciesP
map(corals_p, ~sum(.==0)) %>%
as_tibble() %>%
pivot_longer(cols = everything(),
names_to = "ASV_column",
values_to = "sum_col_eq_0") %>%
# cut off includes 293 or more because 296-293 = 3
filter(sum_col_eq_0 == 0)
## # A tibble: 1 × 2
## ASV_column sum_col_eq_0
## <chr> <int>
## 1 ASV0003 0
# use original data frame to get other columns back into data subset
corals %>%
filter(species == "P") %>%
# take out asv 3 bc we know it's not found in this species
# take out species column
select(-ASV0003, -species) %>%
group_by(Clade ) %>%
summarise(n = n())
## # A tibble: 2 × 2
## Clade n
## <chr> <int>
## 1 A 283
## 2 C 13
corals %>%
filter(species =="S") %>% # 362: cut off will be 362-3 = 359 inclusive
select(-c(1:8)) -> corals_s # saving here in order to pass to map() function
### These are the ASVs that are found in at least 3 samples or more for species S
map(corals_s, ~sum(.==0)) %>%
as_tibble() %>%
pivot_longer(cols = everything(),
names_to = "ASV_column",
values_to = "sum_col_eq_0") %>%
arrange(sum_col_eq_0) %>%
# cut off includes 293 or more because 296-293 = 3
filter(sum_col_eq_0 >=359) -> asvs_speciesS
map(corals_s, ~sum(.==0)) %>%
as_tibble() %>%
pivot_longer(cols = everything(),
names_to = "ASV_column",
values_to = "sum_col_eq_0") %>%
# cut off includes 359 or more
filter(sum_col_eq_0 == 0)
## # A tibble: 0 × 2
## # ℹ 2 variables: ASV_column <chr>, sum_col_eq_0 <int>
dim(corals_s)
## [1] 362 5889
head(corals_s)
## # A tibble: 6 × 5,889
## ASV0001 ASV0002 ASV0003 ASV0004 ASV0005 ASV0006 ASV0007 ASV0008 ASV0009
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2544 0 0 0 0 8018 920 0 847
## 2 38 39 0 55 0 23052 0 0 0
## 3 113 0 0 0 0 270 0 0 0
## 4 28881 0 0 0 0 21607 8200 0 6895
## 5 0 0 64 0 0 10263 0 0 0
## 6 10306 0 0 0 0 611 3533 0 2900
## # ℹ 5,880 more variables: ASV0010 <dbl>, ASV0011 <dbl>, ASV0012 <dbl>,
## # ASV0013 <dbl>, ASV0014 <dbl>, ASV0015 <dbl>, ASV0016 <dbl>, ASV0017 <dbl>,
## # ASV0018 <dbl>, ASV0019 <dbl>, ASV0020 <dbl>, ASV0022 <dbl>, ASV0023 <dbl>,
## # ASV0024 <dbl>, ASV0025 <dbl>, ASV0027 <dbl>, ASV0030 <dbl>, ASV0031 <dbl>,
## # ASV0032 <dbl>, ASV0033 <dbl>, ASV0034 <dbl>, ASV0035 <dbl>, ASV0036 <dbl>,
## # ASV0037 <dbl>, ASV0039 <dbl>, ASV0040 <dbl>, ASV0041 <dbl>, ASV0042 <dbl>,
## # ASV0043 <dbl>, ASV0044 <dbl>, ASV0046 <dbl>, ASV0047 <dbl>, …
Code adapted from here
library(Rtsne)
corals %>%
select(starts_with("ASV")[1:1000]) %>%
as.matrix() -> asv_matrix
library(tsne)
library(plotly)
data("iris")
features <- subset(iris, select = -c(Species))
set.seed(0)
tsne <- tsne(features, initial_dims = 2)
## sigma summary: Min. : 0.376979658833158 |1st Qu. : 0.45299119244845 |Median : 0.509480199794486 |Mean : 0.520650714341092 |3rd Qu. : 0.579571467464058 |Max. : 0.758492715638686 |
## Epoch: Iteration #100 error is: 10.6877963361652
## Epoch: Iteration #200 error is: 0.0730593518613691
## Epoch: Iteration #300 error is: 0.0688210768935925
## Epoch: Iteration #400 error is: 0.0677845552661532
## Epoch: Iteration #500 error is: 0.0674575742993366
## Epoch: Iteration #600 error is: 0.0672632485595604
## Epoch: Iteration #700 error is: 0.0671387737166583
## Epoch: Iteration #800 error is: 0.0670575761754511
## Epoch: Iteration #900 error is: 0.0670020352301366
## Epoch: Iteration #1000 error is: 0.0669657774358675
tsne <- data.frame(tsne)
pdb <- cbind(tsne,iris$Species)
options(warn = -1)
fig <- plot_ly(data = pdb ,x = ~X1, y = ~X2, type = 'scatter', mode = 'markers', split = ~iris$Species)
fig <- fig %>%
layout(
plot_bgcolor = "#e5ecf6"
)
fig
dim(asv_matrix)
## [1] 658 1000
library(tsne)
library(plotly)
features <- subset(asv_matrix) # 658 x 1000
set.seed(0)
tsne <- tsne(features, initial_dims = 2)
## sigma summary: Min. : 0.0717351762735381 |1st Qu. : 0.14417917387887 |Median : 0.25304682961265 |Mean : 0.259781738918223 |3rd Qu. : 0.331287316162124 |Max. : 0.881559216277751 |
## Epoch: Iteration #100 error is: 9.5805974100228
## Epoch: Iteration #200 error is: 0.215750400209934
## Epoch: Iteration #300 error is: 0.167603391129179
## Epoch: Iteration #400 error is: 0.152457463386035
## Epoch: Iteration #500 error is: 0.146676402504261
## Epoch: Iteration #600 error is: 0.143524659317524
## Epoch: Iteration #700 error is: 0.14148496677019
## Epoch: Iteration #800 error is: 0.140058690361463
## Epoch: Iteration #900 error is: 0.1390071568364
## Epoch: Iteration #1000 error is: 0.138210903902764
tsne <- data.frame(tsne)
pdb <- cbind(tsne,corals$species)
options(warn = -1)
fig <- plot_ly(data = pdb ,x = ~X1, y = ~X2, type = 'scatter', mode = 'markers', split = ~corals$species)
fig <- fig %>%
layout(
plot_bgcolor = "#e5ecf6"
)
fig
library(tsne)
library(plotly)
features <- subset(asv_matrix)
set.seed(0)
tsne <- tsne(features, initial_dims = 2)
## sigma summary: Min. : 0.0717351762735381 |1st Qu. : 0.14417917387887 |Median : 0.25304682961265 |Mean : 0.259781738918223 |3rd Qu. : 0.331287316162124 |Max. : 0.881559216277751 |
## Epoch: Iteration #100 error is: 9.5805974100228
## Epoch: Iteration #200 error is: 0.215750400209934
## Epoch: Iteration #300 error is: 0.167603391129179
## Epoch: Iteration #400 error is: 0.152457463386035
## Epoch: Iteration #500 error is: 0.146676402504261
## Epoch: Iteration #600 error is: 0.143524659317524
## Epoch: Iteration #700 error is: 0.14148496677019
## Epoch: Iteration #800 error is: 0.140058690361463
## Epoch: Iteration #900 error is: 0.1390071568364
## Epoch: Iteration #1000 error is: 0.138210903902764
tsne <- data.frame(tsne)
pdb <- cbind(tsne,corals$Clade)
options(warn = -1)
fig2 <- plot_ly(data = pdb ,x = ~X1, y = ~X2, type = 'scatter', mode = 'markers', split = ~corals$Clade)
fig2 <- fig2 %>%
layout(
plot_bgcolor = "#e5ecf6"
)
fig2
# library(tsne)
# library(plotly)
#
#
#
# features <- as.matrix(corals_p, select..........)
#
# set.seed(0)
# tsne <- tsne(features, initial_dims = 2)
# tsne <- data.frame(tsne)
# pdb <- cbind(tsne,corals$Clade)
# options(warn = -1)
# fig <- plot_ly(data = pdb ,x = ~X1, y = ~X2, type = 'scatter', mode = 'markers', split = ~corals$Clade)
#
# fig <- fig %>%
# layout(
# plot_bgcolor = "#e5ecf6"
# )
#
# fig
corals[, -c(1,3,4,5,7,8) ] %>%
filter(species=="P") %>%
select(1, 3:12) %>%
GGally::ggpairs(aes(color = Clade)) -> ggpairs_species_P
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs_species_P
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggsave( "../plots/ggpairs_species_P.png", ggpairs_species_P, width = 14)
## Saving 14 x 5 in image
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
corals[, -c(1,3,4,5,7,8) ] %>%
filter(species=="S") %>%
select(1, 3:12) %>%
GGally::ggpairs(aes(color = Clade)) -> ggpairs_species_S
ggpairs_species_S
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggsave( "../plots/ggpairs_species_S.png", ggpairs_species_S, width = 14)
## Saving 14 x 5 in image
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.